Goto

Collaborating Authors

 McKenzie County


Minimum Wasserstein distance estimator under covariate shift: closed-form, super-efficiency and irregularity

Lang, Junjun, Zhang, Qiong, Liu, Yukun

arXiv.org Machine Learning

Covariate shift arises when covariate distributions differ between source and target populations while the conditional distribution of the response remains invariant, and it underlies problems in missing data and causal inference. We propose a minimum Wasserstein distance estimation framework for inference under covariate shift that avoids explicit modeling of outcome regressions or importance weights. The resulting W-estimator admits a closed-form expression and is numerically equivalent to the classical 1-nearest neighbor estimator, yielding a new optimal transport interpretation of nearest neighbor methods. We establish root-$n$ asymptotic normality and show that the estimator is not asymptotically linear, leading to super-efficiency relative to the semiparametric efficient estimator under covariate shift in certain regimes, and uniformly in missing data problems. Numerical simulations, along with an analysis of a rainfall dataset, underscore the exceptional performance of our W-estimator.


Unsupervised Document and Template Clustering using Multimodal Embeddings

Sampaio, Phillipe R., Maxcici, Helene

arXiv.org Artificial Intelligence

We study unsupervised clustering of documents at both the category and template levels using frozen multimodal encoders and classical clustering algorithms. We systematize a model-agnostic pipeline that (i) projects heterogeneous last-layer states from text-layout-vision encoders into token-type-aware document vectors and (ii) performs clustering with centroid- or density-based methods, including an HDBSCAN + $k$-NN assignment to eliminate unlabeled points. We evaluate eight encoders (text-only, layout-aware, vision-only, and vision-language) with $k$-Means, DBSCAN, HDBSCAN + $k$-NN, and BIRCH on five corpora spanning clean synthetic invoices, their heavily degraded print-and-scan counterparts, scanned receipts, and real identity and certificate documents. The study reveals modality-specific failure modes and a robustness-accuracy trade-off, with vision features nearly solving template discovery on clean pages while text dominates under covariate shift, and fused encoders offering the best balance. We detail a reproducible, oracle-free tuning protocol and the curated evaluation settings to guide future work on unsupervised document organization.




FLOWR: Flow Matching for Structure-Aware De Novo, Interaction- and Fragment-Based Ligand Generation

Cremer, Julian, Irwin, Ross, Tibo, Alessandro, Janet, Jon Paul, Olsson, Simon, Clevert, Djork-Arné

arXiv.org Artificial Intelligence

We introduce FLOWR, a novel structure-based framework for the generation and optimization of three-dimensional ligands. FLOWR integrates continuous and categorical flow matching with equivariant optimal transport, enhanced by an efficient protein pocket conditioning. Alongside FLOWR, we present SPINDR, a thoroughly curated dataset comprising ligand-pocket co-crystal complexes specifically designed to address existing data quality issues. Empirical evaluations demonstrate that FLOWR surpasses current state-of-the-art diffusion- and flow-based methods in terms of PoseBusters-validity, pose accuracy, and interaction recovery, while offering a significant inference speedup, achieving up to 70-fold faster performance. In addition, we introduce FLOWR:multi, a highly accurate multi-purpose model allowing for the targeted sampling of novel ligands that adhere to predefined interaction profiles and chemical substructures for fragment-based design without the need of re-training or any re-sampling strategies


ROSFD: Robust Online Streaming Fraud Detection with Resilience to Concept Drift in Data Streams

Yelleti, Vivek

arXiv.org Artificial Intelligence

Continuous generation of streaming data from diverse sources, such as online transactions and digital interactions, necessitates timely fraud detection. Traditional batch processing methods often struggle to capture the rapidly evolving patterns of fraudulent activities. This paper highlights the critical importance of processing streaming data for effective fraud detection. To address the inherent challenges of latency, scalability, and concept drift in streaming environments, we propose a robust online streaming fraud detection (ROSFD) framework. Our proposed framework comprises two key stages: (i) Stage One: Offline Model Initialization. In this initial stage, a model is built in offline settings using incremental learning principles to overcome the "cold-start" problem. (ii) Stage Two: Real-time Model Adaptation. In this dynamic stage, drift detection algorithms (viz.,, DDM, EDDM, and ADWIN) are employed to identify concept drift in the incoming data stream and incrementally train the model accordingly. This "train-only-when-required" strategy drastically reduces the number of retrains needed without significantly impacting the area under the receiver operating characteristic curve (AUC). Overall, ROSFD utilizing ADWIN as the drift detection method demonstrated the best performance among the employed methods. In terms of model efficacy, Adaptive Random Forest consistently outperformed other models, achieving the highest AUC in four out of five datasets.


Online Consistency of the Nearest Neighbor Rule

Dasgupta, Sanjoy, So, Geelon

arXiv.org Machine Learning

In the realizable online setting, a learner is tasked with making predictions for a stream of instances, where the correct answer is revealed after each prediction. A learning rule is online consistent if its mistake rate eventually vanishes. The nearest neighbor rule (Fix and Hodges, 1951) is a fundamental prediction strategy, but it is only known to be consistent under strong statistical or geometric assumptions--the instances come i.i.d. or the label classes are well-separated. We prove online consistency for all measurable functions in doubling metric spaces under the mild assumption that the instances are generated by a process that is uniformly absolutely continuous with respect to a finite, upper doubling measure.


Modeling chaotic Lorenz ODE System using Scientific Machine Learning

Kashyap, Sameera S, Dandekar, Raj Abhijit, Dandekar, Rajat, Panat, Sreedath

arXiv.org Artificial Intelligence

The Lorenz system of equations is a set of ordinary differential equations to represent a simplified model of atmospheric convection Sparrow [1982]. These set of equations have a wide range of applications in fields ranging from fluid mechanics to laser physics to weather prediction. One of the most interesting properties of the Lorenz ODE System is that it is chaotic in nature Fowler et al. [1982]. Small changes in the initial conditions can lead to vastly different outcomes in the end result Liao S. [2014]. When simulated over a given period, the Lorenz ODEs show oscillations in time. Usually, numerical methods implemented in computational software modeling tools like Python, Julia, or Matlab are used to simulate the Lorenz System of ODEs. These methods are inefficient as Lorentz equations are sensitive to initial conditions and minute changes to the conditions and tiny rounding errors can lead to the accumulation of numerical errors over time. Very few studies have been aimed at integrating machine learning-aided methods in simulating the chaotic Lorenz system. In this study, we provide a robust investigation of the effect of two physics-aided machine learning models in simulating the Lorenz system of ODEs: Neural Ordinary Differential Equations (Neural ODEs) Chen et al. [2018] and Universal Differential Equations (UDEs) Rackauckas et al. [2020a].


Learning Deep Features for Scene Recognition using Places Database

Neural Information Processing Systems

Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.


Poker Hand History File Format Specification

Kim, Juho

arXiv.org Artificial Intelligence

This paper introduces the Poker Hand History (PHH) file format, designed to standardize the recording of poker hands across different game variants. Despite poker's widespread popularity in the mainstream culture as a mind sport and its prominence in the field of artificial intelligence (AI) research as a benchmark for imperfect information AI agents, it lacks a consistent format that humans can use to document poker hands across different variants that can also easily be parsed by machines. To address this gap in the literature, we propose the PHH format which provides a concise human-readable machine-friendly representation of hand history that comprehensively captures various details of the hand, ranging from initial game parameters and actions to contextual parameters including but not limited to the venue, players, and time control information. In the supplementary, we provide over 10,000 hands covering 11 different variants in the PHH format. Building on our previous work on PokerKit, a premier poker hand simulation tool, we demonstrate the usages of our open-source Python implementation of the PHH parser. The source code of the parser is available on GitHub: https://github.com/uoftcprg/pokerkit